4 research outputs found

    Computing fuzzy rough approximations in large scale information systems

    Get PDF
    Rough set theory is a popular and powerful machine learning tool. It is especially suitable for dealing with information systems that exhibit inconsistencies, i.e. objects that have the same values for the conditional attributes but a different value for the decision attribute. In line with the emerging granular computing paradigm, rough set theory groups objects together based on the indiscernibility of their attribute values. Fuzzy rough set theory extends rough set theory to data with continuous attributes, and detects degrees of inconsistency in the data. Key to this is turning the indiscernibility relation into a gradual relation, acknowledging that objects can be similar to a certain extent. In very large datasets with millions of objects, computing the gradual indiscernibility relation (or in other words, the soft granules) is very demanding, both in terms of runtime and in terms of memory. It is however required for the computation of the lower and upper approximations of concepts in the fuzzy rough set analysis pipeline. Current non-distributed implementations in R are limited by memory capacity. For example, we found that a state of the art non-distributed implementation in R could not handle 30,000 rows and 10 attributes on a node with 62GB of memory. This is clearly insufficient to scale fuzzy rough set analysis to massive datasets. In this paper we present a parallel and distributed solution based on Message Passing Interface (MPI) to compute fuzzy rough approximations in very large information systems. Our results show that our parallel approach scales with problem size to information systems with millions of objects. To the best of our knowledge, no other parallel and distributed solutions have been proposed so far in the literature for this problem

    Alcohol use and burden for 195 countries and territories, 1990-2016 : a systematic analysis for the Global Burden of Disease Study 2016

    Get PDF
    Background Alcohol use is a leading risk factor for death and disability, but its overall association with health remains complex given the possible protective effects of moderate alcohol consumption on some conditions. With our comprehensive approach to health accounting within the Global Burden of Diseases, Injuries, and Risk Factors Study 2016, we generated improved estimates of alcohol use and alcohol-attributable deaths and disability-adjusted life-years (DALYs) for 195 locations from 1990 to 2016, for both sexes and for 5-year age groups between the ages of 15 years and 95 years and older. Methods Using 694 data sources of individual and population-level alcohol consumption, along with 592 prospective and retrospective studies on the risk of alcohol use, we produced estimates of the prevalence of current drinking, abstention, the distribution of alcohol consumption among current drinkers in standard drinks daily (defined as 10 g of pure ethyl alcohol), and alcohol-attributable deaths and DALYs. We made several methodological improvements compared with previous estimates: first, we adjusted alcohol sales estimates to take into account tourist and unrecorded consumption; second, we did a new meta-analysis of relative risks for 23 health outcomes associated with alcohol use; and third, we developed a new method to quantify the level of alcohol consumption that minimises the overall risk to individual health. Findings Globally, alcohol use was the seventh leading risk factor for both deaths and DALYs in 2016, accounting for 2.2% (95% uncertainty interval [UI] 1.5-3.0) of age-standardised female deaths and 6.8% (5.8-8.0) of age-standardised male deaths. Among the population aged 15-49 years, alcohol use was the leading risk factor globally in 2016, with 3.8% (95% UI 3.2-4-3) of female deaths and 12.2% (10.8-13-6) of male deaths attributable to alcohol use. For the population aged 15-49 years, female attributable DALYs were 2.3% (95% UI 2.0-2.6) and male attributable DALYs were 8.9% (7.8-9.9). The three leading causes of attributable deaths in this age group were tuberculosis (1.4% [95% UI 1. 0-1. 7] of total deaths), road injuries (1.2% [0.7-1.9]), and self-harm (1.1% [0.6-1.5]). For populations aged 50 years and older, cancers accounted for a large proportion of total alcohol-attributable deaths in 2016, constituting 27.1% (95% UI 21.2-33.3) of total alcohol-attributable female deaths and 18.9% (15.3-22.6) of male deaths. The level of alcohol consumption that minimised harm across health outcomes was zero (95% UI 0.0-0.8) standard drinks per week. Interpretation Alcohol use is a leading risk factor for global disease burden and causes substantial health loss. We found that the risk of all-cause mortality, and of cancers specifically, rises with increasing levels of consumption, and the level of consumption that minimises health loss is zero. These results suggest that alcohol control policies might need to be revised worldwide, refocusing on efforts to lower overall population-level consumption.Peer reviewe

    Fuzzy Rough Set Approximations in Large Scale Information Systems

    No full text
    Thesis (Master's)--University of Washington, 2015Rough set theory is a popular and powerful machine learning tool. It is especially suitable for dealing with information systems that exhibit inconsistencies, i.e. objects that have the same values for the conditional attributes but a different value for the decision attribute. In line with the emerging granular computing paradigm, rough set theory groups objects together based on the indiscernibility of their attribute values. Fuzzy rough set theory extends rough set theory to data with continuous attributes, and detects degrees of inconsistency in the data. Key to this is turning the indiscernibility relation into a gradual relation, acknowledging that objects can be similar to a certain extent. In very large datasets with millions of objects, computing the gradual indiscernibility relation (or in other words, the soft granules) is very demanding, both in terms of runtime and in terms of memory. It is however required for the computation of the lower and upper approximations of concepts in the fuzzy rough set analysis pipeline. In this thesis, we present a parallel and distributed solution implemented on both Apache Spark and Message Passing Interface (MPI) to compute fuzzy rough approximations in very large information systems. Our results show that our parallel approach scales with problem size to information systems with millions of objects. To the best of our knowledge, no other parallel and distributed solutions have been proposed so far in the literature for this problem. We also present two distributed prototype selection approaches that are based on fuzzy rough set theory and couple them with our distributed implementation of the well known weighted k-nearest neighbors machine learning prediction technique to solve regression problems. In addition, we show how our distributed approaches can be used on the State Inpatient Data Set (SID) and the Medical Expenditure Panel Survey (MEPS) to predict the total healthcare expenses of patients

    Distributed fuzzy rough prototype selection for big data regression

    No full text
    Size and complexity of Big Data requires advances in machine learning algorithms to adequately learn from such data. While distributed shared-nothing architectures (Hadoop/Spark) are becoming increasingly popular to develop such new algorithms, it is quite challenging to adapt existing machine learning algorithms. In this paper, we propose a solution for big data regression, where the aim is to learn the regression model over large high-dimensional datasets. First, a new distributed implementation of the weighted kNN regression method is presented followed by a novel distributed prototype selection method based on fuzzy rough set theory. Experiments demonstrate that our implementations in Apache Spark for the proposed distributed algorithms handle the size and complexity of modern real-world datasets well. We furthermore show that application of our prototype selection method improves the regression accuracy
    corecore